Lecture 5

This lecture taught by of Prof. Cathy Yi-Hsuan Chen focuses on textual analysis. Here we introduce NLP (Natural Language Processing) techniques including Tokenization, Negation handling, Part of Speech tagging (PoS) and Lemmatization.

Specifically, the code can be found in the Github

Some useful online courses, stanford univesrsity

Outlines


James Quick

What is NLP

  • with NLP, we can generate "machine-readable" text, that can be further analyzed by Artificial Intelligence or Machine Learning
  • with NLP, we build an interaction between computers and humans using the natural language
  • Most NLP techniques rely on machine learning to derive meaning from human languages
  • NLP is considered a difficult problem in computer science. It’s the nature of the human language that makes NLP difficult.

How NLP works

  • NLP identifies and extracts the natural language rules such that the unstructured text is converted into a form that computers can understand
  • Given the text provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them
  • It may happen that the computer fails to understand the meaning of a sentence well, leading to obscure results.

Zoom in NLP

  • Text is unstructured data with implicit structure
    • Text, sentences, words, characters
    • Nouns, verbs, adjectives
    • Grammar
  • Transform implicit text structure into explicit structure
  • Reduce text variation for further analysis
  • Python Natural Language Toolkit NLTK

NLTK

Natural Language Toolkit is an open source library for the Python programming language. It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which explains the principles behind the underlying language processing tasks that NLTK supports.

Tokenize and tag text

  • Decompose a string into sentences
  • Decompose a sentence into words/tokens
""" read text """
with open('shakespeare.txt', 'r', encoding='utf-8') as shakespeare_read:
    # read(n) method will put n characters into a string
    shakespeare_string = shakespeare_read.read()
""" remove stop words """

STOPWORDS = ["an", "a", "the", "or", "and", "thou", "must", "that", "this", "self", "unless", "behind", "for", "which",
             "whose", "can", "else", "some", "will", "so", "from", "to", "by", "within", "of", "upon", "th", "with",
             "it"]

def _remove_stopwords(txt):
    """Delete from txt all words contained in STOPWORDS."""
    words = txt.split()
    # words = txt.split(" ")
    for i, word in enumerate(words):
        if word in STOPWORDS:
            words[i] = " "
    return (" ".join(words))
""" create a list of sentences """
import re
from collections import Counter

doc_out = []
for k in shakespeare_split:
    cleantextprep = str(k)
        # Regex cleaning
    expression = "[^a-zA-Z ]"  # keep only letters, numbers and whitespace
    cleantextCAP = re.sub(expression, '', cleantextprep)  # apply regex
    cleantext = cleantextCAP.lower()  # lower case
    cleantext = _remove_stopwords(cleantext)
    bound = ''.join(cleantext)
    doc_out.append(bound)       # a list of sentences
""" decompose a list of sentences into tokens(words) """
import nltk

def decompose_word(doc):
    txt = []
    for word in doc:
        txt.extend(word.split())
    return txt

# decompose a list of sentences into words by self-defined function
tokens = decompose_word(doc_out)
# decompose a list of sentences into words from NLTK module
tokens_nltk = nltk.word_tokenize(str(doc_out))

Stopwords

""" removing stopwords """
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
# stop words in English
stop_words = set(stopwords.words('english'))
# stop words in German
stop_words_German = set(stopwords.words('german'))
# stop words in Italian
stop_words_italian = set(stopwords.words('italian'))

word_tokens = word_tokenize(example_sent)
# compact syntax
filtered_sentence = [w for w in word_tokens if w not in stop_words]
# standard syntax
filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Lemmatization

""" removing stopwords """
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
# rocks : rock
print("corpora :", lemmatizer.lemmatize("corpora"))
# corpora : corpus
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
#better : good

Sentiment / Textual analysis

How to detect the tone or sentiment of text? Using the predefined lexicon that collects positive versus negative words with semantic polarity, we screen the tokens and count the words being classified via the employed lexicon. If the frequency of positive words is predominant than that of negative words, we infer an optimistic tone/sentiment in the text.

  • We employ opinion lexicon to identify text polarity. The lexicon-based approach for opinion mining depends on opinion (or sentiment) words, which are words that express positive or negative sentiments. Words that encode a desirable state (e.g., "great" and "good") have a positive polarity, while words that encode an undesirable state have a negative polarity (e.g., "bad" and "awful"). Although opinion polarity normally applies to adjectives and adverbs, there are verb and noun opinion words as well.

  • The lexicon contains two classes (positive/negative) and 10 specific lists.

James Quick

""" Import lexicon for scrutify """
# Negative lexicon
ndct = ''
with open('bl_negative.csv', 'r', encoding='utf-8', errors='ignore') as infile:
    for line in infile:
        ndct = ndct + line

# create a list of negative words
ndct = ndct.split('\n')
ndct = [entry for entry in ndct]
len(ndct)    # 4783 negative words

# Positive lexicon
pdct = ''
with open('bl_positive.csv', 'r', encoding='utf-8', errors='ignore') as infile:
    for line in infile:
        pdct = pdct + line

pdct = pdct.split('\n')
pdct = [entry for entry in pdct]
len(pdct)  # 2009 positive words
""" screen tokens that are in the selected lexicon """

# list of negative words and their frequency
nwc = wordcount(tokens, ndct)   # wordcount(text,lexicon)
# [['die', 3], ['famine', 1], ['lies', 2], ['foe', 1], ['cruel', 1], ['gaudy', 1], ['waste', 2], ['pity', 1], ['besiege', 1], ['tattered', 1], ['weed', 1], ['sunken', 1], ['shame', 3], ['excuse', 1], ['cold', 1], ['beguile', 1], ['wrinkles', 1], ['dies', 1], ['abuse', 1], ['deceive', 1], ['hideous', 1], ['sap', 1], ['frost', 1], ['prisoner', 1], ['bereft', 1], ['ragged', 1], ['forbidden', 1], ['death', 1], ['burning', 1], ['weary', 1], ['feeble', 1], ['sadly', 1], ['annoy', 1], ['offend', 1], ['chide', 1], ['wilt', 2], ['fear', 1], ['wail', 1], ['weep', 1], ['deny', 1], ['hate', 2], ['conspire', 1]]

# list of positive words and their frequency
pwc = wordcount(tokens, pdct)
# [['tender', 2], ['bright', 1], ['abundance', 1], ['sweet', 5], ['fresh', 2], ['spring', 1], ['proud', 1], ['worth', 1], ['beauty', 7], ['treasure', 3], ['praise', 2], ['fair', 3], ['proving', 1], ['warm', 1], ['fond', 1], ['lovely', 2], ['golden', 2], ['loveliness', 1], ['free', 1], ['beauteous', 2], ['great', 1], ['gentle', 2], ['work', 1], ['fairly', 1], ['excel', 1], ['leads', 1], ['willing', 1], ['happier', 2], ['gracious', 2], ['homage', 1], ['majesty', 1], ['heavenly', 1], ['strong', 1], ['adore', 1], ['like', 2], ['joy', 2], ['gladly', 1], ['pleasure', 1], ['sweetly', 1], ['happy', 1], ['pleasing', 1], ['well', 1], ['enjoys', 1], ['love', 4], ['beloved', 1]]

# Total number of positive/negative words
ntot, ptot = 0, 0
for i in range(len(nwc)):
    ntot += nwc[i][1]

for i in range(len(pwc)):
    ptot += pwc[i][1]
""" Print results """

print('Positive words:')
for i in range(len(pwc)):
    print(str(pwc[i][0]) + ': ' + str(pwc[i][1]))
print('Total number of positive words: ' + str(ptot))
print('\n')
print('Percentage of positive words: ' + str(round(ptot / nwords, 4)))
print('\n')
print('Negative words:')
for i in range(len(nwc)):
    print(str(nwc[i][0]) + ': ' + str(nwc[i][1]))
print('Total number of negative words: ' + str(ntot))
print('\n')
print('Percentage of negative words: ' + str(round(ntot / nwords, 4)))

Create wordcloud

A demonstration of wordcloud generated from Shakespeare text

James Quick

""" import wordcloud from python 3.7 envi """

from wordcloud import WordCloud # using python 3.7

comment_words = ' '
for token in tokens:
    comment_words = comment_words + token + ' '

wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                min_font_size = 10).generate(comment_words)
""" plot wordcloud """

import matplotlib
matplotlib.use("TkAgg")  # only needed for Mac OS
from matplotlib import pyplot as plt

plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.savefig("wordcloud.png",format='png',dpi=200)
plt.show()

Additional Resources

results matching ""

    No results matching ""